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ON  CLUSTERING  TECHNIQUES  OF 
CITATION  GRAPHS 

F.  P.  Preparata  and  R.  T.  Chien 

ABSTRACT 

In  this  paper  we  report  results  in  the  application  of  graph 
theory  to  the  problem  of  clustering  in  document  retrieval  systems  using 
bibliographic  coupling  devices.  The  problem  is  attacked  by  mapping  .the 
citation  graph  of  the  document  collection  onto  a  unidimensional  storage 
array.  The  figure  of  merit  of  the  location  assignment  is  the  total  dis¬ 
tance  between  connected  pairs  of  documents,  or,  equivalently,  the 
"stretching"  resulting  from  the  mapping.  This  is  the  objective  function 
of  the  problem.  An  algorithm  is  then  presented  for  the  reduction  of  the 

objective  function,  which  provides  a  currently  improving  solution.  Its 

3/2 

computational  complexity  only  grows  as  N  ,  where  N  is  the  collection 


size. 


I.  Introduction 


The  problem  of  organizing  a  large  universe  of  objects  with  the 

purpose  to  Identify  sets,  In  such  a  fashion  that  objects  within  a  set  are 

similar  to  each  other  but  are  dissimilar  from  objects  outside  the  set  has 

12  3 

received  considerable  attention  over  the  past  years  *  ’  as  a  fundamental 
topic  in  the  theory  of  classification.  As  it  has  been  observed  in  many  of 
the  mentioned  works,  however,  the  vagueness  of  terms  such  as  "similar", 
dissimilar",  or,  equivalently,  the  qualitative  nature  of  the  relations 
existing  among  the  objects  of  the  universe  have  largely  prevented  the  use 
of  a  mathematical  framework  in  the  modeling  of  the  problem.  Yet  the  notions 
of  similarity  and  dissimilarity  are  quite  primitive  in  our  semantics  and, 
therefore, organizational  criteria  inspired  by  these  concepts  appear  quite 
natural  for  large  universes  of  elements. 

The  above  mentioned  qualitative  nature  of  the  interrelations 
among  objects  is  also  reflected  by  the  adoption  of  the  term  "cluster"  in 
lieu  of  set,  thus  implying  the  intuitive  identification  of  some  "core" 
along  with  some  "fuzziness"  in  the  definition  of  the  boundaries  of  such 
sets  (x)  . 

The  ^'clustering  problem"  is  definitely  central  in  information 

retrieval,  particularly  in  document  retrieval  with  reference  both  to 

document  classification  and  to  automatic  indexing  (see,  e.g.^) .  The 

12  3 

clustering  techniques  proposed  heretofore  ’  ’  are  based  on  some  reason¬ 
ably  defined  concept  of  "cohesion"  among  members  of  the  document  cluster. 

*A  closely  related  concept,  in  fact,  is  that  of  "fuzzy  set",  proposed  by 
Zadeh'1'  with  reference  to  a  universe  whose  elements  have  various  degrees  of 
membership  in  several  sets  of  the  universe. 
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Given  a  quantitative  value  to  the  pairwise  association  among  documents 
(for  example,  based  on  the  number  of  common  keywords)  the  universe  is 
represented  as  an  undirected  graph  (undirected  because  the  association 
between  two  documents  is  obviously  reciprocal),  whose  nodes  are  repre¬ 
sentative  of  documents  and  whose  weighted  edges  are  representative  of 

12  3 

document  associations.  The  reader  is  referred  to  *  ’  for  a  detailed 
discussion  of  different  clustering  techniques,  ail  based  however  on  the 
criterion  of  assigning  a  document  to  the  cluster  with  which  it  has  the 
highest  "global"  association.  It  suffices  here  to  point  out  that  from  a 
computational  point  of  view  the  proposed  methods  are  characterized  by  the 
fact  that  the  effort  required  grows  roughly  with  the  square  of  the  collec¬ 
tion  size  (as  one  would  intuitively  expect  from  methods  entirely  based  on 
matTix  algorithms)  . 

The  approach  we  present  in  this  paper,  although  closely  germane 
to  those  mentioned  above,  draws  its  immediate  motivation  from  a  rather 
Important  problem  which  typically  manifests  itself  in  computer-based 
document  retrieval  systems.  The  complexity  of  an  information  retrieval 
task  (processing  of  a  query  against  a  file)  depends  largely  on  the  physical 
location  of  the  documents  in  the  file.  Quite  generally,  in  a  computer 
based  system  the  processing  time  is  a  monotone  increasing  function  of  the 
total  time  necessary  to  access  the  item  required  from  the  computer  storage; 
each  individual  access  time  is  in  turn  a  monotone  nondecreasing  function 
of  the  relative  distance,  in  the  memory  structure,  of  each  pair  of  items 
sequentially  accessed.  From  this  general  remark,  it  appears  quite  desirable 
to  locate  physically  close  in  the  memory  structure  items  that  are  likely  to 
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be  wanted  together  (for  example.  In  the  same  cylinder  of  a  disc  file  or 
in  the  same  strip  of  a  magnetic  strip  file) . 

This  aspect  of  a  computer  based  system  becomes  dominant  when 
the  interrelation  among  documents  is  expressed  by  the  relation  of  citation 
between  a  source  document  and  a  reference  document.  In  this  case,  in 
fact,  the  search  algorithm  itself  proceeds  along  paths  of  a  graph,  and  a 
means  to  Improve  the  system's  performance  is  to  bring  at  a  small  physical 
distance  in  storage  documents  which  are  close  in  some  intuitively  acceptable 
sense  in  the  collection  graph  (namely,  a  "citation"  graph).  Therefore,  the 
existence  of  a  citation  link  between  a  pair  of  documents  is  taken  as  a 
sign  of  similarity,  or,  equivalently,  of  likelihood  of  them  being  wanted 
together. 

In  the  sequel  we  discuss  a  method  which  is  aimed  ot  the  identifica¬ 
tion  of  sets  of  documents  which  are  "close"  in  the  citation  graph.  This  is 
done  by  mapping  the  graph  onto  a  unidimensional  array  and  by  successively 
rearranging  the  locations  assigned  to  document.  The  criterion  governing 
the  location  assignment  is  the  reduction  of  the  "stretching"  of  graph 
links  as  produced  by  the  mapping.  This  is  equivalent  to  the  reduction 
of  the  total  stretching  (objective  function)  and  will,  on  the  average, 
bring  to  close-by  locations  in  the  array  documents  which  are  close  in  the 
graph.  The  presented  algorithm  is  effective  in  the  sense  that  only  reduc¬ 
tions  of  the  total  stretching  are  produced;  in  a  slightly  modified  version, 
it  is  efficient  since  its  complexity  from  a  computational  standpoint  grows 
only  as  where  N  is  the  size  of  the  collection. 
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It  is  interesting  to  notice  that  the  objective  function  of  the 
problem  is  monotonically  non- increasing  as  the  proposed  algorithm  proceeds: 
hence, depending  upon  considerations  of  policy  or  of  diminishing  return, 
processing  may  be  stopped  at  any  point,  the  resulting  configuration  being 
certainly  not  worse  than  the  initial  one. 
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II.  Definitions  and  Problem  Statement 

A  collection  of  N  documents  {d. ,  d. ,...,d  }  is  described  as  an 

i  4  n 

undirected  citation  graph,  the  nodes  of  which  are  the  documents.  An 

edge  A.  ,  between  d.  and  d.  exists  if  and  only  if  either  1)  "d.  cites 
nit  h  n  n 

d,  "  or  2)"  d.  is  cited  by  d,".  Hence  4  is  completely  described  by  its 
K  n  at 

N  X  N  connection  matrix  B  ■  ||  b^k  j|  where  bhk  ■  b^  ■  1  if  and  only  if 
either  1)  or  2)  hold  (by  this  we  also  imply  that  all  citation  edges  have 
the  same  weight) . 

With  U  we  denote  a  unidimensional  array  of  N  cells  {l,2,...,N}, 
which  can  be  pictured  as  the  set  of  points  with  positive  integral  abscissa 
on  the  line  segment  [l,  N] . 


An  assignment  A  of  &  is  a  mapping  of  A  onto  U  such  that  if 

(J.,  J2»  •••»-}  )  la  a  permutation  of  the  integers  (1,2,...,N),  document  d 
i  /  n  J  ^ 

is  assigned  to  cell  i.  (d  ”*  i)  . 

-*i 

Given  a  generic  assignment  A,  assume  b^  ■  1  and  that  d^  -•  1^ 
d^  -•  i^.  The  quantity  s^^  ■  |  i^  -  i  |  is  termed  the  relative  stretching 
of  edge  under  the  assignment  A.  Hence,  for  each  assignment  the 


quantity 


1  .  N 

S  “  2  h,k»l  bhk  Shk 


termed  the  total  relative  atrretchlne .is  perfectly  defined  and  computable. 


At  this  point  it  is  convenient  to  introduce  some  functions  which 
are  defined  on  the  set  of  cells  of  U.  To  avoid  confusion,  the  value  that 
*a  function  f  takes  at  a  cell  j  will  be  indicated  with  f^  (superscripted) . 

Let  d  *•  i  under  an  assignment  A.  Further  let  n  be  the  degree 
Ji  Ji 
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of  d  in  6,  i.e.  Che  number  of  documents  directly  connected  to  d  . 

Ji  t  Ji 

Of  these,  assume  that  under  A,  r  have  been  assigned  cells  whose  markings 

are  greater  than  i  and  i *  have  been  assigned  cells  whos  markings  are 


smaller  than  i;  therefore 


i  .  .i 

n  -  r  +  l 
Ji 


i  ,i 


In  other  words,  r  ,  /  are  the  numbers  of  "stretched"  edges  emanating 
from  i  and  going  respectively  to  the  right  and  to  the  left  of  1,  if  cells 

1,2,...,N  are  arranged  in  natural  order  from  left  to  right. 

For  each  cell  1  we  introduce  the  incremental  function  s* 

...  i  i  .i 

(1)  s  «  r  -  i 

and  the  cumulative  function 


i  ■  A 

(2)  f1  -  fa  sJ 


We  notice  on  passing  that  f  gives  the  number  of  links  that  are  intercepted 
by  an  ideal  section  between  i  and  i  +  1.  In  fact 


f1  -  Z  sj 
f  "  jSl  8 


1  1  1  1 
jSi  '  -  th  1 


which  shows  that  f  equals  the  number  of  links  going  to  the  right  from  cells 

1.2.. ...1  minus  the  subset  of  these  which  terminate  on  cells  of  the  set 

2.3.. ..,!,  which  confirms  our  assertion.  Further 


(3)  S  -  fa 


In  fact,  s,  .  »  |  h  -  k  |  can  be  thought  of  as  giving  a  unit  contribution 

Jh 

h  h+1  *  R  ^ 

to  f  ,  f  (in  the  case  that  h  <  k) .  From  this  observation,  and 

g 

the  remark  that  f  ■  0  (no  links  are  present  on  the  right  of  cell  N) 
relation  (3)  follows  Immediately. 


Example :  Assume  that  the  undirected  graph  of  Figure  1  is  given. 


Fig.  1  -  A  Citation  Grnph 

If  now  dj  is  mapped  into  cell  j  of  a  unidimensional  array  U  we  have  the 
following  assignment  (Figure  2) . 


Fig.  2  -  initial  Assignment  £  -•  U 
For  this  initial  assignment  we  have  a  total  stretching  S  »  56. 

Let  us  now  consider  a  generic  node  d  ,  such  that  under  A  d  -  i. 


Let  d,  be  connected  to  d,  ,  d.,...,d.  under  A,  as  usual,  d,  -  h  .  We 

4  ,  4.  4,  4  4  m 
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define  the  potential  function  of  d  as 

4 


*1|-A  |l-hr| 


In  other  words  gives  the  sum  of  the  relative  stretching  of  all  the 
links  connected  to  d  f  d  is  placed  in  cell  j  without  affecting  the  aasign- 

4  4 

ment  of  any  other  document.  We  remark  that  for  a  given  assignment  A: 

a)  for  any  cell  J  such  that  h^  <  j  <  hf+^  the  increment  of  is  given  by 

«p|+1  -  tp|  -  r  -  (s-r) 

i.e.  the  difference  between  the  number  of  links  connected  to 


in  '"n  in  i. 


d  and  the  number  of  links  connected  to  d  ,  ...,d 


r+1 


J 


Jv 


In  fact  the  displacement  of  d.  one  position  to  the  right  causes  the 

Ji 

stretchings  of  the  links  connected  to  the  former  set  to  increase  by  one 
unit,  while  the  ones  pertaining  to  the  latter  set  are  decreased  by  one 
unit.  Hence  in  the  Interval  h^  <  J  <  hr+j,  is  a  linear  function 
whose  increment  is  2r*s  (constant  in  the  interval) . 

b)  a|1  |  each  cell  h^  h^,...,^  the  increment  of  ©^  for 
Increasing  m  undergoes  a  positive  discontinuity  of  2,  since  at  each  such 
cell  one  link  passes  from  the  right  set  to  the  left  set. 

Hence  tp|  decreases  if  2r-s  <  0,  increases  if  2r-s  >  0.  Remarks  a)  and  b) 
can  be  combined  in  the  following  proposition. 


Proposition  -  If  d,  is  connected  to  d 
~  Jj 


\  S 


j  * • • >  d ,  the 


function  cp^  is  a  convex  piecewise  linear  function  which  attains  its  minimum 
at: 


hs/2  -  J  -  hs/2  +  1 


if  s  is  even 


'a  +  1 
2 


if  s  is  odd 


As  a  second  remark,  following  directly  from  the  definitions,  we 


have  that 


1  N  i 

2  i-1  ^i 


Finally  we  consider  the  problem  of  generating  a  new  assignment  A' 
from  a  given  assignment  A.  The  basic  operation  we  shall  use  to  this  end  is 


w 

■  V, 
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the  right  cyclic  permutation;  If  (d,  ,  d,  , ...,d,  )  ere  assigned  to 

•^r  ^r+1  ^s 

(r,  r+l,...,s)  respectively,  after  performing  the  cyclic  permutation 
(s  |  r)  they  will  be  assigned  to  (r  +  1,  r  +  2,...,s,  r)  respectively. 
Obviously,  any  assignment  can  be  obtained  from  any  other  assignment  through 
a  finite  number  of  right  cyclic  permutations:  in  fact  any  assignment  is  a 
permutation,  each  permutation  is  equivalent  to  a  finite  number  of  trans¬ 
positions,  each  transposition  is  equivalent  to  a  finite  number  of  right 
cyclic  permutations  (RCP)' .. 

It  is  now  of  interest  to  find  an  expression  for  the  change  of  S 

determined  by  an  RCP.  We  first  notice  that  an  RCP  (s  |  r)  results  from 

the  successive  performance  of  the  dislocation  (d,  r  +  1,  d  — 

■^r  ^r  +  1 

r  +  2,...,  d  -*s)  and  of  the  insertion  d  -*  r.  '  Let 
Ja-1  Js 

us  examine  separately  the  effect  of  these  two  operations  on  S. 

Consider  the  dislocation  and  the  following  sets  of  links: 

T  *  set  of  links  from  (1,  2,  ...,  r-1)  to  (s,  a  +  1,...,N) 
rs 

Urs  "  aet  of  linkfl  from  O*  2,..., r-1)  to  (r,  r  +  l,...,s-l) 

V  -  set  of  links  from  (r,  r  +  1,  ...,  s-l)  to  (s,  s  +  1,  . . . ,  N) 

Let  t  ,  u  ,  v  be  the  cardinalities  of  T  ,  U  ,  V 

rs  rs  rs  rs  rs  rs 

respectively.  The  dislocation  does  not  affect  the  stretchings  of  links 

of  Trjj,  while  the  stretchings  of  all  links  of  is  increased  by  one  unit 

and  of  all  those  of  is  decreased  by  one  unit.  Hence  the  change  ASj^  of 

S  due  to  the  dislocation  alone  is 


rs 


A  S,  ■  u  -  v 
1  rs 
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Bue  since 


we  heve 


f'-'.u  +  t 

re  re 


£  •  v  +  t 

re  re 


iSj-  £r_1  -  f*”1 


Consider  next  the  insertion  d  -*  r.  The  chenge  A  of  S  due  to  this 
operation  alone  would  be  exactly  cpg  -  <p"  if  the  dislocation  did  not  ta)ce 
place.  A  correction  is  therefore  necessary.  Specifically  the  stretching  of 
each  link  from  a  to  (r,  r  +  l,...,s-l)  appears  reduced  by  1  in  AiS^»  while 
it  oust  actually  Increase  by  1  by  effect  of  the  RCP.  Hence,  if  there  are 
v|r  such  links  the  total  change  of  S  is 


A  S  »  A  S,  +  A  S,  +  2v  » 

1  2  ar 

,,r- 1  r.  .,s-l  a.  , 

-  (f  +  ©8)  -  (f  +  cp8)  +  2vflr 

This  is  summarized  by  the  following  theorem 

Itieorem  -  The  change  A  S  of  S  determined  by  the  RCP  (s  is 

(4)  A  Ssr  -  (fr_1  +  ©8)  -  (f8'1  +  ©8)  +  2vsr 

where  is  the  number  of  links  from  s  to  (r,  r  +  l,...,s-l). 

After  performing  and  RCP  (sfr)  the  values  of  f^  are  modified. 

This  modification,  however  affects  f-*  only  for  r  <  j  <  s.  Specifically  let 
vr8  ■  h  and,  let  s  be  linked  to  i^,  ij,...,!^  with  r  <  i^  <  i^  <  . .  .<  i  <  s. 
Denote  with  f'^the  values  of  f^  after  performing  (s|r).  Then  we  have  the 


following  relations: 
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f*  ^  ■  f  ^  for  1  <  J  <  r  and  •  <  j  <  N 

r  % 

f'J-  fJ_1  +  (£*  -  f*”1)  +  2h  r  <  J  <  ix 

(5)  f J-  fJ‘l  +  (f*  -  £,_1)  +  2(h-m)  iB<J<  i^j 

(a  -  1,2,..., h-1) 

f J-  fj-1  +  (i‘  -  f*"1) 


He  have  now  all  che  necessary  Cools  £or  Che  developmenC  of  an 
algorithm  aimed  ac  Che  reducCion  of  Che  funcClon  S.  In  face,  If  Che  funedon 
f^  Is  known  we  can  rapidly  ascerCaln  from  (4)  wheCher  a  proposed  RCP  (sjr5 
will  resulC  In  a  net  decrease  of  S:  As  <  0  will  be  assumed  as  Che  decision 
rule  for  iCs  execution.  Secondly,  Che  funcClon  f^  can  be  updated  in  a 
relatively  simple  manner  with  the  aid  of  relations  (5) .  The  algorithm  is 
presented  in  the  next  section. 
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III.  An  Algorithm  for  the  Reduction  of  S 

Given  an  assignment  A  of  6,  i.e.  a  mapping  d,  i  where  d  Is 

J1 

a  document  of  the  collection  and  i  a  cell  of  U,  we  construct  the  following 
tables  of  N  entries: 


a)  -  Its  i-th  entry  contains  the  cell  of  U  in  which  d^  Is 
currently  stored. 

b)  Tj  -  Its  j-th  entry  contains  the  Identification  of  the  document 
currently  stored  In  cell  j  of  U.  (i.e.  is  the  inverse  of 
Tl>‘ 

c)  "  Its  j-th  entry  contains  the  current  value  of  f ^ . 

d)  -  Its  h-th  entry  contains  the  list  of  all  documents 

linked  in  £  to  d  . 

n 

With  the  aid  of  these  four  tables  we  can  now  give  the  following 
algorithm  for  the  reduction  of  the  total  relative  stretching  S. 

Algorithm  1 


1. 

2. 

3. 

4. 

5. 


-  Set  j  -  2 

-  Access  the  j-th  entry  of  T„:  let  this  be  d, 

-  Access  the  h-th  entry  of  and  obtain  all  documents  linked 

to  i  . 
n 

-  Obtain  from  the  cells  in  which  the  documents  obtained 
in  step  3  are  stored. 

-  With  the  aid  of  T  ,  compute  »  fm  *  +  <pm  +  2v  for 

3  j  j® 


®  -  1J-1,  j-2, . 
be  ^r. 


Find  the  minimum  of  .  Let  this 


•  • 
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6.  -  Form  A  ■  ^  •  If  A  <  0  go  to  step  7.  Else  perform  (J  |  r) . 

7.  -  Update  T^  and  T2  and  Xj. 

8.  -  If  J  ■  N,  stop.  Else  replace  j  with  j+1  and  return  to  Step  2. 

Application  of  Algorithm  1  certainly  satisfies  the  requirement  that 
the  current  value  of  S  be  monotonically  non  increasing.  In  substance,  with 
Algorithm  1  we  scan  U  from  left  to  right  one  cell  at  each  step  and  determine 
whether  the  document  contained  in  the  last  scanned  cell  can  be  "brought" 
to  the  left  through  an  RCP  resulting  in  a  net  decrease  of  S.  After  comple¬ 
tion  of  a  left-to-right  scanning  of  U,  a  right-to-left  scanning  is  per¬ 
formed,  to  possibly  relocate  documents  for  which  a  AS  >  0  was  obtained 
during  the  former  scanning:  this  completes  a  processing  cycle. 

Example:  Let  us  consider  again  the  citation  graph  i  of  Fig.  1 
and  its  initial  assignment  shown  in  Flg.2.(S  »  56).  We  perform  now  a  left- 
to-right  pass  in  the  application  of  algorithm  1  to  U.  The  result  of  this 
processing  is  shown  in  Fig.  3:  we  also  have  S  ■  34. 


Fig.  3  -  After  a  left-to-right  pass 

We  perform  then  a  right-to-lef t  pass,  which  yields  the  assignment 
shown  in  Fig.  4  (S  -  30) . 


However  simple  the  example  may  be,  the  effectiveness  of  the  algorithm 
is  apparent  already  after  a  single  pass:  the  two  clusters  of  £  are  in  fact 
already  identifiable.  The  performance  of  a  second  pass  also  fhows  that  we 
are  approaching  a  point  of  diminishing  return  in  the  attempt  to  reduce  S:the 
application  of  the  algorithm  may  reasonably  stop  after  obtaining  the 
assignment  of  Figure  4. 
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If  in  a: .  cycle  we  find  that  no  cyclic  permutation  can 
be  performed,  the  processing  is  terminated:  in  this  case  we  have  reached 
a  local  minimum  of  S.  At  this  point  there  is  no  evidence  whether  the 
obtained  minimum  is  also  the  absolute  minimum:  as  a  matter  of  fact  it  is 
possible  to  concoct  some  clever  "interlocking"  configurations  which  correspond 
to  relative  mimina.  It  seems  reasonable,  however,  to  Introduce  at  this 
stage  a  perturbation  of  the  reached  assignment  in  the  form  of  a  single 
random  permutation  of  the  same,  and  then  apply  Algorithm  1  to  the  new 
assignment.  It  is  likely  that  this  device  may  lead  out  of  the  trap 
represented  by  a  local  minimum. 

Leaving  this  rather  important  question,  some  comment  is  necessary 
with  regard  to  the  computational  complexity  of  Algorithm  1.  Steps  2,  3,  4, 

6,  8  require  a  fixed  amount  of  computation  per  document  processed.  Step  5, 
however,  requires  the  calculation  of  f”  for  each  m  <  j  and  therefore  its 
complexity  is  proportional  to  j.  Step  7,  in  the  case  that  a  permutation 
is  performed,  requires  the  updating  of  T^,  and  T^,  the  complexity  of 
which  is  proportional  to  (j-r^)  for  some  r ^  <  j .  Let  r^  -  or^  j  with 
0  <  Ofj  <  1.  Then  let  C^(j)  be  the  computational  complexity  of  the  i-th 
step  in  processing  the  j-th  document.  The  total  complexity  C  of  one 
scanning  is  therefore 


N 


Obviously,  for  some  0  <  a  <  1, 
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N  , 

E  *4J  ■  f  (IT  +  N) 

1  J  *■ 

c7*  cs  a 

hence  letting  c2  +  c3  +  c^  +  +  cfl  +  ®  5  +  -  e  end  j2*  *  -  b 

we  heve  , 

C  ■  b  N  +  a  N 

which  shows  a  square  law  rate  of  growth  for  C .  To  avoid  this  undesirable 
feature  we  propose  the  introduction  of  some  approximations  in  Algorithm  1. 
We  notice  that  only  steps  5  and  7  contribute  to  the  term  in  :  these  we 
want  to  modify. 

Let  us  first  consider  step  7.  Assume  that  a  permutation  (uj t) 
has  been  performed  with  u  <  s.  Without  updating  the  function  for 
t  <  J  <  u  and  the  location  of  the  documents  contained  in  cells  t,  t+l,...,u, 
let  us  compute  the  quantity 

<»  AS*  -  »*■)  +  (f*-1  .  f*-1)  +  */ 

or  o  □  or 

(The  symbols  are  asterisked  to  denote  that  the  functions  are  relative  to 
the  assignment  before  the  permutation  (u Jt)) .  Then  if  t  >  r 
(S*ee  Fig.  5) 


Fig.  5 

we  have  that  fal«f8*fr*»fr*  and  v  ■  v  .  Assume  then  that  s 

’  sr  sr 

is  linked  to  cells  in  the  interval  (t,  t+1 . u-1),  and  that  5  ■  1,0  if 

s  is  or  is  not  linked  to  u,  respectively.  We  have 
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<p*r  •  <p*  -  H  +  6(u-t+l) 

9*a  -  <p|  +  H  -  6(u-t+l) 

It  follows  ChsC 

AS*  -  (®r  -  <p“)  +  (f1’1-  f8"1)  -  2 \i  +  26<u-t+l)  +  4v 
®r  s  sx 

Obviously  v>r  >  n  and  8(u-t+l)  >  0.  Hence 

AS*  >  (<pr  -  ®“)  +  (fr_1  -  f8"1)  +2v  +  26(u-t+l)  >  AS 

IT  8  8  8  r  8T 


★ 

It  follows  chat  If  AS  <  0,  then  also  AS  <  0:  our  criterion  can  there- 

sr  sr 

fore  be  applied  to  AS*r  as  given  by  eq.  (6).  If  t  <  r,  the  same  argument 

*r-l 

can  be  applied  with  the  only  exception  that  now  f  +  f  .We  notice 
that  due  to  the  permutation  (ult) 


i.e.  the  function  f^  is  decreasing  on  the  average.  Hence  we  feel  justified 

r-1  *r-l 

In  the  conservative  approximation  f  ~  f  ,  and  conclude  that 


AS 


sr 


If  now  AS  Is  computed  with  reference  to  the  functions  which  were  current 
sr 

before  the  performance  of  v  permutations,  relation  (6)  is  easily  generalized 


♦  (f*'-1  -  f**-1)  +  2(v  + 1)  v*r 


We  see  therefore  that,  if  a  permutation  is  decided  with  reference  to  AS^, 
the  updating  of  T^,  T^,  can  be  performed  after  v  executed  permutations. 
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The  ensueing  updating  procedure  could  be  a  modification  of  the  one  expressed 
by  relations  (5):  it  appears  simpler,  however  to  proceed  through  the 
reevaluation  of  s^  for  every  cell  j  affected  by  any  of  the  v  permutations. 
Specifically,  let  these  permutations  be  (s^  |  r^)  (1  »  l,2,...,v)  and  let 

¥  «  max  s^  ,  T  -  min  r^  ; 

then  for  r  <  J  <  ¥  ,  compute  s^  and  f ^ .  Obviously,  v  must  be  rather  small 

to  avoid  gross  approximations.  A  convenient  spacing  of  the  updating  runs 

is  provided  by  the  following  discussion  of  step  5. 

In  the  search  for  the  minimum  of 

*m  *m- 1  *m 
t  -  f  +<Pj  +  2(v+l)  vjm 

after  v  permutations  not  followed  by  updating,  assume  that  the  interval  (l,j) 
is  subdivided  into  the  following  segments: 

(1,  a),  (a  +  1,  2a),  ...  ,  (ha  +  1,  j) 

where  h  ■  ["— ] ,  i.e.  the  highest  integer  smaller  than  j/a.  Let 

1^  ■  [aa+1,  (a+l)a3  and  gfl  ■  min  f  for  m  €  Ig.  Also,  let 

a.  «  max  Ttp .m  +  2(v+l)v.  ]  for  m  €  j  .  It  follows  that 
TJs  j  '  jm  Js 

mln  t*m  <  g8  +  «Pjs  - 

The  function  t  is  taken  as  an  indication  of  the  values  of  *  in  I  ,  the 
Ts  T  s 

better  the  approximation  the  smaller  the  parameter  a.  Then  if 

t  ■  min  ♦  s  ■  l,2,...,h 

n  Ts 

we  perform  the  calculation  of  t  for  ®  €  1^. 
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In  summary,  the  search  for  $*m  entails  the  computation  of 
♦l»  *n<*  ^  °for  a'  different  values  of  m.  Hence  its 

complexity  is  given  by 

cj  h  +  Cj  a  ■  Cj  j  +  Cj  a 

The  choice  of  a  as  a  function  of  j  is  crucial  with  regard  to  the  complexity 
of  step  5  over  a  complete  scanning  of  the  collection.  It  is  very  simple  to 
show  that  the  quantity 

J 

c5  a  +  c^a 

for  fixed  j  and  variable  a  attains  its  lowest  value  for 


With  the  insight  provided  by  this  relation  we  subdivide  the  interval  (1,N) 
into  the  following  set  of  segments: 

(1,2),  (3,6),  (7,12),. ..,(p2  -  p+1,  p2  +  P),... 

2  2  2 
Let  Kp  *  (P  "  P+l>  P  +  p)  and  PM!(bt  the  smallest  p  suchthat  p  +  p  >  N 

or,  approximately,  p  -  JiT.  For  each  i  €  K  a  -  p.  Hence  the 

max  p 

computational  complexity  of  step  5  for  the  totality  of  i  €  X  is 

P 

c|!  -  2(c5  +  c'5)  p2  -  (2c J.  +  c5)p 
which  summed  over  all  p's  yields  approximately 

2  (c,  +  c')  N3''2  -  (2c '  +  c_)  N 

*  J  J  J  D  ^ 


w«  aaHMiWf  IffllSP! 
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Analogously,  assume  that  we  perform  a  fixed  number  w  of  updating  runs 

per  Interval  X  .  The  complexity  of  an  updating  run  is  proportional  to 

2  P 
p  ,  that  is 

c*  p2 

7  P 

which  summed  over  all  p's  yields 

Jr 

t  *  ^  ' 


In  summary,  with  the  artifices  introduced  to  modify  the  search 
for  the  minimum  of  A  S  and  the  updating  procedures  of  the  pertinent  functions, 
we  have  a  computational  procedure  whose  complexity  grows  only  as  N372. 
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">In  this  paper  we  report  results  in  tbe-rapplication  of  graph 
theory  to  the  problem  of  clustering  in  document  retrieval  systems  using 
bibliographic  coupling  devices.  The  problem  is  attacked  by  mapping  the 
citation  graph  of  the  document  collection  onto  a  unidimensional  storage 
array.  The  figure  of  merit  of  the  location  assignment  is  the  total  dis¬ 
tance  between  connected  pairs  of  documents,  or,  equivalently,  the 
"stretching"  resulting  from  the  mapping.  This  is  the  objective  function 
of  the  problem.  An  algorithm  is  then  presented  for  the  reduction  of  the 

objective  function,  which  pfovides  a  currently  improving  solution.  Its 

3/2 

computational  complexity  only  grows  as  f  ,  where  N  is  the  collection 
size.  I 
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